E n = n + 1. λ n k (x k ˆm) 2. n 1. (x k m) 332:525 Homework Set 1

Size: px

Start display at page:

Download "E n = n + 1. λ n k (x k ˆm) 2. n 1. (x k m) 332:525 Homework Set 1"

Everett Lambert
5 years ago
Views:

1 332:525 Homework Set Estimation Problems. Recursive Least-Squares (RLS) Estimators: Consider a sequence of iid random variables x n, n = 0,,..., and form the running average of the first numbers considered as an estimate of the mean m = E[x n ]: ˆm n = x 0 + x + +x n (a) Show that ˆm n is the optimum solution that minimizes the sum of squares: E n = (x k ˆm) 2 What is the minimized value of E n? (b) Moreover show that ˆm n can be re-expressed in the following timerecursive forms, with the second being a Kalman-type predictor/corrector form: ( ) ( ) n ˆm n = ˆm n + x n ( ) ˆm n = ˆm n + (x n ˆm n ) Note that these recursions connect the optimum least-squares solutions of two different performance indices. Indeed, ˆm n minimizes the performance index E n, whereas ˆm n minimizes E n which runs up to time k = n. (c) Show that ˆm n is an unbiased estimator of the mean. Determine the variance of ˆm n, that is, the quantity, var( ˆm n )= E[( ˆm n m) 2 ] and show that ˆm n is a consistent estimator of the mean. Hint: Show first that ˆm n m = (x k m) and use the assumption that the x n are iid, which implies the decorrelation condition: E[(x i m)(x j m)]= σ 2 x δ ij. 2. RLS Estimators with Forgetting Factor: The RLS estimator ˆm n of the previous problem is appropriate for stationary sequences, that is, whose statistical characteristics don t change over time. Indeed, the performance index E n treats all time samples from the earliest to the latest on a equal footing. Initially, the estimator ˆm n converges very fast to the optimum value m and then gets stuck at that optimum value because the Kalman-type gain factor that appears in the time-update becomes extremely small with increasing n. If there is a non-stationary change in the statistics and the mean m changes to a new value, the estimator ˆm n will have a very hard time tracking this change. A more appropriate estimator for tracking non-stationary changes in the statistics would be one that places more emphasis on the more recent data and less on the older data. For example, the following weighted version of E n emphasizes the current samples more and forgets the older ones exponentially fast: E n = λ n k (x k ˆm) 2 where the forgetting factor λ must be 0 <λ. Note that λ = recovers the above stationary case. (a) Determine the optimum ˆm that minimizes E n and cast it in a timerecursive form such as: ˆm n = ˆm n + b n (x n ˆm n ) How does b n behave in the limit λ? Show that ˆm n is an asymptotically unbiased estimator of m. (b) Show that for fairly large values of n and for λ, the estimator satisfies the first-order difference equation (otherwise known as a first-order smoother): ˆm n = λ ˆm n + ( λ)x n () 3. RLS Estimators with Forgetting Factor: The first-order smoother estimator of Eq. () was obtained for fairly large values of n. However, it can be thought of as a third-type of estimator in its own right. Assume, therefore, that Eq. () defines ˆm n for all n 0. Show that it is asymptotically unbiased but not consistent. Indeed, show that in the limit of large n, the variance of ˆm n tends to the finite value: 2

2 var( ˆm n )= E [( ˆm n E[ ˆm n ] ) 2 ] λ + λ σ2 x However, by choosing λ it can be made as small as desired, thus providing a good estimator. The trade-off is that the closer λ is to, the more sluggish the estimator becomes in tracking non-stationarities. 4. Least-Mean-Square (LMS) Estimators: Consider the theoretical performance index E( ˆm)= E[(x ˆm) 2 ] (2) (a) Differentiating it with respect to ˆm, show that E is minimized for the optimum value of the parameter ˆm = m = E[x]. (b) The LMS algorithm is based on the idea of steepest descent in which ˆm is changed iteratively so that at each iteration the performance index E is decreased and eventually it reaches its minimum value. The key condition is to demand that going from one value of ˆm to the next, say, ˆm + Δ ˆm, will result in a smaller performance index, that is, E( ˆm + Δ ˆm) E( ˆm). This can be guaranteed by choosing the change Δ ˆm to be proportional to the negative gradient of E, that is, (with μ>0) Δ ˆm = μ E (LMS update) ˆm Replace the theoretical gradient by the instantaneous one: E ˆm E ˆm = 2(x n ˆm n ) Apply the LMS update to the instantaneous gradient, that is, ˆm n+ = ˆm n + Δ ˆm n = ˆm n μ E ˆm n And, show that it can be written in a similar form as the RLS estimator of Eq. (): ˆm n+ = λ ˆm n + ( λ)x n where λ = 2μ. Thus, the LMS and RLS algorithms for the recursive estimation of the mean are essentially equivalent. Note, however, that in adapting more than just one parameter, the LMS and RLS algorithms are no longer equivalent the latter having a much faster learning speed at the expense of higher computational cost. 5. Do problems.9 and.0. For Problem.0, suppose the mixing parameter ɛ is known in advance. Instead of sending x and y into a correlation canceler, you carry out a preprocessing operation, replacing {x, y} by the signals {x,y }, where x = x and y = y ɛx, and then send those into a correlation canceler. Determine the optimum canceler weight H. Show that now the noise component of x can be canceled completely. Draw a block diagram of all the processing operations. Note: The circumstances of this problem arise in adaptive antenna sidelobe canceling systems that use linearly polarized antennas. Polarization is used as a useful discriminant between signal and interference. In this application, the parameter ɛ is related to the known polarization angles of the desired signal. The interference signal is also polarized but with unknown polarization angles with respect to the antennas, but that does not matter because the subsequent adaptive canceler determines them adaptively and cancels the interference completely. 6. (a) Let ˆx be the optimum linear estimate of a scalar x based on the random vector y. Show that ˆx remains invariant under a linear invertible transformation of the observation vector, y = Bz (b) Show that E[eˆx]= 0 and E[e 2 ]= E[ex], where e = x ˆx. (c) If x is uncorrelated with y, show that ˆx = Let x be a random variable with mean E[x]= m. We wish to estimate x in terms of a zero-mean vector of observations y. Because the mean of x is not zero, we seek an estimate of the form ˆx = h T y + b The b-term is called a bias term. Assume the correlations R = E[yy T ] and r = E[xy] are known. Show that the optimum choices for h and b that minimize the mean square estimation error E = E[e 2 ], where e = x ˆx, are h = R r and b = m Note: It is straightforward to reformulate such biased estimates adaptively. They are very common especially in neural network applications. 8. (a) Show that the optimum estimate of y based on itself is itself, that is, ŷ = y. (b) Let z = Qy, where Q does not have to be invertible or square. Show that the optimum estimate of z based on y is given by ẑ = Qy, that is, ẑ = z. 3 4

3 (c) Suppose [ ] y is divided into two subvectors y and, that is, y = y. Using the results of the previous part or working directly, show that the optimum estimate of y based on y, that is, ŷ = E[y y T ]E[yy T ] y, is given simply by ŷ = y. 9. (a) A random variable x is related to the random vectors y and by [ ] x = c T y + c T 2 + v = [c T, c T y 2 ] + v = c T y + v where v is uncorrelated with y and. Show that the [ best] estimate y of x based on the combined observation vector y = is given by ˆx = c T y. Therefore, the y-dependent part of x is completely canceled from the error output e = x ˆx, that is, e = v, in this case. (Hint: Show that the solution of the normal equations is h = c.) (b) Determine the optimum estimate ˆx = h T y of x based only on the first observation vector y and show that in this case the y -dependent part of x is still canceled completely from the error output e = x ˆx, whereas the -dependent part is canceled as much as possible, in the sense that e is given by where e = v + c T 2 ( ŷ 2/ ) ŷ 2/ = E[ y T ]E[y y T ] y = R 2 R y is the best estimate of based on y.(hint: Express h in terms of c, c 2, R, R 2.) (c) Show that the minimized mean square error of the above case is given by: E=E[e 2 ]= σv 2 + c T ( 2 R22 R 2 R R T 2) c2 where R 22 = E[ y T 2 ]. Why is the second term in E non-negative? Note: The results of this problem will be used later to develop guidelines for picking the filter order in adaptive filtering applications y 0. Let R = be the covariance matrix of y =, assumed to have zero mean. Determine the innovations representation y 3 y = Bɛ by carrying out the Gram-Schmidt orthogonalization of the components of y. Then, verify the factorization R yy = BR ɛɛ B T by explicit matrix multiplication. Next consider the estimation of a random variable x in terms of y. The 4 cross correlation between x and y is known to be r = E[xy]= 4. 2 Determine the optimum estimation weights h and g with respect to the correlated basis y and the innovations basis ɛ, that is, ˆx = h T y = g T ɛ Hint: Use g = D Lr and h = L T g, where D = R ɛɛ and L = B.. For the previous problem, compute the optimum estimates of x based on the three successively bigger subspaces Y = {y }, Y 2 = {y, }, Y 3 ={y,,y 3 }, in the forms ˆx = h y = g ɛ ˆx 2 = h 2 y + h 2 = g 2 ɛ + g 22 ɛ 2 ˆx 3 = h 3 y + h 3 + h 33 y 3 = g 3 ɛ + g 32 ɛ 2 + g 33 ɛ 3 Show that the g-weights are independent of the order, that is g pi = g i, where g i was found in the previous problem. Show that the above estimates can be recursively constructed by ˆx = g ɛ ˆx 2 = ˆx + g 2 ɛ 2 ˆx 3 = ˆx 2 + g 3 ɛ 3 Assuming σx 2 = 30, use the recursions E i = E i g 2 i E[ɛ2 i ], where E i = E[e 2 i ]= E[ (x ˆx i ) 2], to determine the successive estimation errors E, E 2, E 3. Note the gradual improvement of the estimate as the number of observations is increased. Finally, determine the predictions ŷ 2/ and ŷ 3/2 of and y 3 based on the past subspaces Y and Y 2, respectively, write them in the forms, ŷ 2/ = a 2 y = b 2 ɛ ŷ 3/ = a 3 y a 32 = b 3 ɛ + b 32 ɛ 2 and show that the inverse innovations matrix L = B can be expressed as: L = 0 0 b 2 0 b 3 b 32 = 0 0 a 2 0 a 3 a

4 2. Consider the deterministic random signal y n = 2 cos(ω n + φ), where ω = π/3 and φ is a random phase distributed uniformly over the interval [0, 2π]. (a) Show that y n satisfies an ordinarnd order homogeneous difference equation. (b) Using the definition R(k)= E[y n+k y n ], show that R(k)= 2 cos(ω k). (c) Let y = [y 0,y, ] T be three consecutive samples. Using the results in (b), determine the 3 3 autocorrelation matrix R = E[yy T ] and show that it has zero determinant. (d) Because of the singularity of R, we expect the Cholesky factorization to break down at dimension 3. To see this, carry out the Gram- Schmidt orthogonalization of y starting with y 0 and ending with, and thereby determine the factorization R = BR ɛɛ B T. Is the result consistent with part (a)? 3. (a) Let R(k) be the autocorrelation function of a stationary random signal y[ n. Express ] the autocorrelation matrix of the random vector yn y = in terms of R(k). Then, show the general inequality y n+k R(k) R(0), for all k (b) Let u, v be two random variables. Show the Schwarz inequality: Hint: y = [u, v] T. E[uv] 2 E[u 2 ]E[v 2 ] Supplement Probability and Statistics Problems. (a) Let x be a zero-mean gaussian random variable with variance σ 2 x. Show E[x 4 ]= 3σ 4 x (b) Let x = [x,x 2,...,x N ] T be a block of mutually uncorrelated zeromean gaussian random variables each with variance σx 2. Using the above result, show ( ) E[x i x j x k x l ]= σx 4 δij δ kl + δ ik δ jl + δ il δ jk Show also that their covariance matrix is R xx = E[xx T ]= σ 2 xi where I is the N N identity matrix. (c) Suppose the above N random variables x are mixed up by an arbitrary invertible linear transformation y = Bx resulting into the new set of gaussian random variables y = [y,,...,y N ] T. Let R = E[yy T ] be their covariance matrix. Show that R = σ 2 BB T (d) Show the analogous result of part (b): E[y i y j y k y l ]= R ij R kl + R ik R jl + R il R jk 2. An estimate of the mean m of N independent identically distributed random variables {y,,...,y N } of variance σ 2 can be formed by the weighted sum ˆm = h y + h 2 + +h N y N Determine expressions for the mean and variance of ˆm, that is, the quantities E[ ˆm] and var( ˆm). What are the constraints on the weights h i in order for ˆm to be an unbiased estimate of m? What are the optimal choices for these weights, if in addition, it is required that the variance var( ˆm) be minimum? 3. The sample mean of N independent gaussian random variables {y,,...,y N } of mean m and variance σ 2 is given by 7 8

5 ˆm = ( ) y + + +y N N First, show that ˆm is unbiased and its variance is var( ˆm)= σ 2 /N. Then, show that the probability density of ˆm is p( ˆm)= N /2 (2π) /2 σ exp[ N ( ˆm m)2] 2σ2 Moreover, show that as N, this density converges to the deterministic delta function density p( ˆm) δ( ˆm m). 4. Consider N independent gaussian random variables {y,,...,y N } of mean m and variance σ 2. The sample variance is defined as ˆσ 2 = N (y i ˆm) 2 N where ˆm is the sample mean as defined above. Show that the mean and variance of the sample variance are given by E[ˆσ 2 ]= N N σ2, var(ˆσ 2 )= N N 2σ4 N Note: This is somewhat lower than the CR lower bound 2σ 4 /N. But, this is no contradiction because the CR bound applies to unbiased estimators and the above is slightly biased. 5. Continuing with the previous problem, we can form an unbiased estimator for the variance by the standard deviation: s 2 = N N (y i ˆm) 2 Therefore, s 2 = N ˆσ 2 /(N ). Show that its mean and variance are This does satisfy the CR bound. E[s 2 ]= σ 2, var(s 2 )= 2σ4 N 6. Next, we determine that the probability distribution of s 2 is a χ 2 -distribution with (N ) degrees of freedom. In the definition of s 2, there are N squared terms (y i ˆm) 2, yet we divided by (N ) not N. But, these terms are not mutually independent because of the presence of ˆm. Using these dependencies, one can express s 2 as a sum of (N ) independent square terms, as follows. (a) Consider the following linear transformation (know as Helmert s transformation) from the set {y,...,y N } to a new set {z,...,z N }: z i = c i ( y + + +y i iy i+ ), i =, 2,...,N z N = c N ( y + + +y N ) Determine the scale factors c i in order for the z i s to have unit variance. (b) Then, show that the z i have zero mean and are mutually uncorrelated: E[z i z j ]= δ ij, i,j =, 2,...,N (c) Then, show that the linear transformation preserves the sum of the squares, N z 2 i = N σ 2 therefore, it is an orthogonal transformation. Finally, show that the sum of the first (N ) squared terms is i N χ 2 = z 2 i = N (y σ 2 i ˆm) 2 Thus, the sum of the N squared terms in the right-hand-side follows a normalized χ 2 -distribution with (N ) degrees of freedom. 7. The following twenty random numbers come from an unknown probability distribution: {0.33, 0.52, 2.4,.93, 0.46, 0.44, 0.97, 0.38, 0.48,.29,.82,.23, 0.2, 2.66,.22, 0.4, 0.95,.47, 0.83, 0.43} Test the hypothesis that the underlying distribution is gaussian with zero mean and unit variance. To do this perform the χ 2 test by dividing the range of the gaussian distribution into the following six bins: 9 0

6 (,.5), (.5, 0.5), ( 0.5, 0.0), (0.0, 0.5), (0.5,.5), (.5, ) If the i-th bin is the interval (x i,x i ), then the theoretically expected number of observations that will fall into the i-th bin will be N th i N = F(x i) F(x i ) where N is the total number of observations and F(x) is the cdf of the assumed gaussian distribution, that is, F(x)= 2π x e z2 /2 dz Let N i be the actual number of observations that fall into the i-th bin. Then, calculate the χ 2 statistic given by χ 2 = B (N i N th i ) 2 where B is the number of bins here, B = 6. This quantity follows a χ 2 - distribution with B degrees of freedom. Thus, its mean will be equal to the number of degrees of freedom, namely, B. If your calculated χ 2 is near the theoretical mean B, then you cannot reject the hypothesis that the pdf was gaussian. Alternatively, you can look up the 95 percent confidence interval of the χ 2 distribution with B degrees of freedom, that is, the interval 0 χ 2 χ such that the probability of a χ 2 value falling in it is 0.95 or equivalently, the probability of a χ 2 value falling outside it is only Then, if your calculated value of χ 2 falls within that interval you can with 95 percent confidence conclude that the gaussian assumption cannot be rejected. Note: For B = 5 degrees of freedom, we have χ = Let F(x) be the cdf of a pdf f(x). Show that the random variable u defined by N th i u = F(x) is distributed uniformly over the interval [0, ). Therefore, random variables x following the pdf f(x) can be generated from a uniform random number generator using the inverse function x = F (u). This is the inversion method for generating random numbers from uniform ones (see Appendix A). 9. The Rayleigh probability density finds application in fading communication channels: p(r)= r σ 2 e r2 /2σ 2, for r 0 Using the inversion method, show how to generate a Rayleigh-distributed random variable r from a uniform variable u. 0. The inversion method may also be applied to the problem of generating discrete-valued random variables. Let x be a random variable that can only take one of the discrete values {x,x 2,...,x M } with probabilities {p,p 2,...,p M }, respectively. It is assumed, of course, that the p i sum up to unity. You have available a uniform generator in the interval [0, ). Explain how to generate the discrete random numbers x from a uniform u.. You want to simulate a binary experiment in which only two outcomes can occur, one with probability p and the other with probability p. For example, simulating successive throws of heads or tails, or the transmission of bits 0 or, or, an accept/reject decision, etc. This is the same as the previous problem, with M = 2. The procedure for picking one or the other outcome can be mechanized as follows:. Generate a uniform u. 2. If 0 u<p, then pick the first outcome. 3. If p u<, then pick the second outcome. Explain why this procedure generates the two outcomes with the correct probabilities p and p. Note: The optimization method of simulated annealing uses such twovalued random variables. It is an iterative method of minimizing a performance index J(λ), where λ is a vector of parameters with respect to which J must be minimized. Consider two successive choices of the parameter vector, λ new and λ old, and compute the change in the performance index: ΔJ = J(λ new ) J(λ old ). Most iterative minimization algorithms, such as steeliest descent or Newton s method, try to continuously keep decreasing J, that is, they demand that the change in λ always be such that ΔJ 0. This can easily drive the λ into a local minimum of J and then the algorithm gets stuck there. To alleviate this problem, the so-called Metropolis algorithm of simulated annealing allows on occasion J to increase, that is, ΔJ > 0, in order to 2

7 jump over such local minima and continue decreasing towards the absolute minimum. The algorithm is as follows: If ΔJ 0 then accept the change in the parameter vector λ old λ new. But if ΔJ > 0 then accept the change only with probability p = e βδj and reject the change with probability p, where β is a suitable positive constant. Using the results of this problem, it should be clear how one will make the decision of whether to accept or reject the change. 2. Consider the Box-Muller transformation x = ( 2lnu) /2 cos(2πv), y = ( 2lnu) /2 sin(2πv) Show that if {u, v} are independent uniform random variables in the interval [0, ), then {x, y} are two independent gaussian random variables with zero mean and unit variance. 3. Consider the generalized Box-Muller transformation x = ( 2lnu) /2 cos(2πv), y = ( 2lnu) /2 sin(2πv φ) where φ is a constant angle. Show that if {u, v} are independent uniform random variables in the interval [0, ), then {x, y} are two jointly gaussian random variables with zero mean, unit variance, and correlation coefficient E[xy]= cos φ. 4. Let X and X 2 be two independent random variables with cdf s F (x) and F 2 (x). Show that the random variable X = max(x,x 2 ) has cdf F(x)= F (x)f 2 (x). Show also that X = min(x,x 2 ) has cdf F(x)= F (x)+f 2 (x) F (x)f 2 (x). 5. The inversion method of generating random variables is convenient only when the cdf F(x) is known in closed form or is easily computed. An alternative method that works well when the pdf f(x) is known but the cdf F(x) is complicated, like the gaussian case, is the rejection method. It requires two conditions that are not difficult to meet: First, there exists a so-called majorizing pdf g(x) such that f(x) is bounded from above by f(x) cg(x), for all x where c is a given constant. Second, it is much easier to generate random variables from the distribution g(x) than from f(x). The following algorithm generates an x distributed according to f(x):. Generate an x from the distribution g(x). 2. Generate a y which is uniformly distributed over [0, cg(x)]. 3. If y f(x), then output x; else, go to step and repeat. To show that this procedure correctly generates x s that are distributed according to f(x), we must show that the conditional density of an x generated as above and given that y f(x), is equal to the desired density f(x), that is, p ( X = x Y f(x) ) = f(x) (a) Show first that necessarily c and that p ( Y f(x) X = x ) = f(x) cg(x) which follows from the fact that y is uniform. (b) Then, integrate the above over all x s generated from g(x) to get p ( Y f(x) ) = c (c) Finally, use Bayes rule to determine the quantity p ( X = x Y f(x) ) = p ( Y f(x) X = x ) p(x = x) p ( Y f(x) ) 6. Let y be an M-dimensional gaussian random vector with zero mean and covariance matrix R. Show that the information content or entropy of y is given by S = p(y)ln p(y)d M y = ln(det R) 2 up to an unimportant additive constant. 7. Let y = Bɛ be the innovations representation of an M-dimensional gaussian zero-mean vector. Show that its entropy can be written, up to an additive constant, as follows S = p(y)ln p(y)d M y = 2 M ln E i where E i = E[ɛ 2 i ] are the variances of the innovations. 3 4

8 8. (a) For any two positive real numbers a and b, show the inequality [ ] a a ln b a b (b) Let y be an M-dimensional random vector. For any two probability densities p(y) and q(y), prove the following information inequality, [ ] p(y) p(y)ln d M y 0 q(y) with equality attained when p(y)= q(y). 9. Consider the subset of all M-dimensional probability densities p(y) that have a given mean m and covariance Σ. Show that the density from this subset that has maximum entropy, S = p(y)ln p(y)d M y = max is the gaussian. Hint: Use Lagrange multipliers to enforce the given constraints. Alternatively, use the information inequality of the previous problem. 20. Let Re i = λ i e i, i =, 2,...,M be the M eigenvalues and orthonormal eigenvectors of the covariance matrix of an M-dimensional random vector y. Define the M transformed random variables: Thus, the M-vector y is represented by only L < M parameters, namely, z,z 2,...,z L. This approximation forms the basis of data compression using the Karhunen-Loeve transform. (c) Show the equality of quadratic forms M y T R z 2 i y = λ i (d) Determine the pdf p z (z) of the vector z = [z,z 2,...,z M ] T in terms of the pdf p y (y) (do not assume gaussian distributions). Show that the information content of y is the same as that of z, in the sense that they have equal entropies. (e) If we denote by B the modal matrix of R, that is, the matrix whose columns are the eigenvectors e i, then show that y is related to the z-basis as y = Bz, where B = [e, e 2,...,e M ] Show also that B satisfies BB T = B T B = I, and that R = BDB T, D = diag(λ,λ 2,...,λ M ) z i = e T i y, i =, 2,...,M (a) Show that they are mutually uncorrelated with variances λ i, that is, E[z i z j ]= λ i δ ij (b) Show that y can be expanded in terms of the z i as follows: y = M z i e i Thus, the randomness of y arises only from the randomness of the z i s which are uncorrelated. If the eigenvalues are arranged in decreasing order and the first L largest eigenvalues are dominant, then the sum may be approximated by y L z i e i 5 6

9 332:525 Solutions with optimum solution:. Differentiating E n with respect to ˆm and setting the gradient to zero gives: which has solution: E n ˆm = 2 ˆm n = (x k ˆm)= 0 n x k n = n x k In part (b), the required recursions were shown in class. For part (c), we take expectations of both sides of the definition of ˆm n to get: Next, we have: E[ ˆm n ]= ˆm n E[ ˆm n ]= ˆm n m = The variance of ˆm n will be then E[x k ]= E[( ˆm n m) 2 ]= () 2 x k j=0 m = m m = (x k m) E[(x k m)(x j m)] And, using the iid assumption, we have E[(x k m)(x j m)]= σ 2 x δ kj which gives for the variance of ˆm n : E[( ˆm n m) 2 ]= () 2 σxδ 2 kj = () 2 σ2 x j=0 2. The gradient of the performance index is now: E n ˆm = 2 λ n k (x k ˆm)= 0 = σ2 x ˆm n = n λ n k x k n λn k = x n + λx n + λ 2 x n 2 + +λ n x 0 + λ + λ 2 + +λ n Using the finite geometric series, we may write the denominator as which gives for the estimator ˆm n : λ n k = + λ + λ 2 + +λ n = λn+ λ ˆm n = ( λ) n λ n k x k λ n+ Replacing n by n and multiplying by a factor of λ, gives: λ ˆm n = ( λ)λ n λ n k n x k ( λ) λ n k x k λ n = λ n Thus, we can express the sum up to k = n in terms of ˆm n : n ( λ) λ n k x k = λ( λ n ) ˆm n Therefore, we obtain the recursion for ˆm n ˆm n = ( λ)( n λ n k x k + x n ) λ n+ = λ λn+ λ n+ ˆm n + λ λ n+ x n which can be written in the predictor/corrector form: ( ) λ ˆm n = ˆm n + (x λ n+ n ˆm n ) In the limit λ, the Kalman gain coefficient tends to the expected limit: ( ) λ lim = λ λ n+ On the other hand, if λ is strictly less than one, then the term λ n+ can be ignored after a few iterations, and therefore, the recursion becomes essentially the first-order smoother: ˆm n = ˆm n + ( λ)(x n ˆm n )= λ ˆm n + ( λ)x n 7 8

10 3. The difference equation ˆm n = λ ˆm n + ( λ)x n can be solved assuming zero initial conditions, by convolving the x n sequence with the filter sequence ( λ)λ n. This gives: ˆm n = ( λ) λ n k x k Taking expectations of both sides and using the finite geometric series, we obtain: E[ ˆm n ]= ( λ) λ n k m = λn+ λ m which tends to m for large n. Thus, ˆm n is asymptotically unbiased. Subtracting the mean E[ ˆm n ] from ˆm n gives also: ˆm n E[ ˆm n ]= ( λ) λ n k (x k m) Using the same sort of calculation as in Problem, we obtain for the variance of ˆm n : E [( ˆm n E[ ˆm n ] ) 2 ] = ( λ) 2 = ( λ) 2 n j=0 n j=0 λ n k λ n j E[(x k m)(x j m)] λ n k λ n j σxδ 2 kj = σx( 2 λ) 2 n λ 2(n k) = σ 2 x( λ) 2 λ2(n+) λ 2 = λ + λ σ2 x( λ 2(n+) ) which in the limit of large n converges to the required result. 4. The theoretical gradient is: E ˆm = 2E[(x n ˆm)]= 2(m ˆm) Thus, it vanishes when ˆm = m. The instantaneous gradient is obtained by dropping the expectation value, that is, 9 E ˆm = 2(x n ˆm) Putting this into the LMS updating equation gives: ˆm n+ = ˆm n + Δ ˆm n = ˆm n μ E ˆm n = ˆm n + 2μ(x n ˆm n ) Setting 2μ = λ, we rewrite the difference equation as ˆm n+ = ˆm n + 2μ(x n ˆm n )= ( 2μ) ˆm n + 2μx n = λ ˆm n + ( λ)x n 5. Problem.9: Using x = s + n = s + Fn 2 and y = n 2, we find R yy = E[ ]= E[n 2 2] and R xy = E[xy]= E[(x+Fn 2 )n 2 ]= FE[n 2 2]. The optimal canceler will be H = R xy R yy = FE[n 2 2]E[n 2 2] = F. The corresponding optimum estimate will be ˆx = Hy = Fn 2, and the estimation error e = x ˆx = (s + Fn 2 ) Fn 2 = s. Problem.0: First determine H. Noting that y = n 2 + ɛs = F n + ɛs and using the definition of the gain G, wefindr yy and R xy : Therefore, R yy = E[yy]= ( ) F 2 E[n2 ]+ɛ 2 E[s 2 ]= F + 2 ɛ2 G E[n 2 ] R xy = E[xy]= ( ) F E[n2 ]+ɛe[s 2 ]= F + ɛg E[n 2 ] H = R xy R yy = The error output will be F + ɛg F( + ɛfg) = F + + ɛ 2 ɛ2 G 2 F 2 G e = x ˆx = x Hy = s + n H ( F n + ɛs ) = ( ɛh)s + ( H ) n F 20

11 Thus, the coefficients a and b will be a = ɛh = b = H F = ɛf( + ɛfg) + ɛ 2 F 2 G = ɛf + ɛ 2 F 2 G + ɛfg ɛf = ɛfg + ɛ 2 F 2 G + ɛ 2 F 2 G = ɛfga If the coefficient ɛ is known in advance, then the pre-processed signals will be x = x = s + n = s + Fn 2 y = y ɛx = n 2 + ɛs ɛs ɛfn 2 = ( ɛf)n 2 Thus, y is correlated only with the noise part of x. We find E[x y ] = F( ɛf)e[n 2 2] E[y y ] = ( ɛf) 2 E[n 2 2] and, therefore, 6. For part (a), we have ˆx = H y = H = E[x y ]E[y y ] = F ɛf F ɛf ( ɛf)n 2 = Fn 2 e = x ˆx = s + Fn 2 Fn 2 = s E[yy T ]= BE[zz T ]B T E[yy T ] = B T E[zz T ] B And, similarly, E[xy]= BE[xz] The optimal Wiener weights with respect to the two bases are: h = E[yy T ] E[xy], Therefore, they are related by g = E[zz T ] E[xz] h = E[yy T ] E[xy]= B T E[zz T ] B BE[xz]= B T g or, h T = g T B. It follows that the optimal estimate ˆx will be invariant under a change of basis: Parts (b) and (c) were done in class. ˆx = h T y = g T B Bz = g T z 7. The estimation error is e = x ˆx = x h T y b. The minimization conditions for the performance index E = E[e 2 ] are E h = 2E[ e e ] = 2E[ey]= 0 h E b = 2E[ e e ] = 2E[e]= 0 b which are equivalent to E[ey] = E[(x y T h)y]= E[xy] E[yy T ]h = r Rh = 0 E[e] = E[x h T y b]= E[x] h T E[y] b = m b = 0 8. Part (a) follows from part (b) with the choice Q = I. For part (b), we have R zy = E[zy T ]= QE[yy T ]= QR yy H = R zy R yy = Q It follows that ẑ = Hy = Qy = z. Part (c) can be shown as follows: Note that the subvector y can be obtained from the full vector y by the projection matrix [ ][ ] I 0 y y = = Qy 0 0 where I is the identity matrix with the same dimension as y. Using part (b) with z = y, we find ŷ = y. This result can also be shown directly, as follows. Using the notation R ij = E[y i y T j ], for i, j =, 2, we have E[y y T ] = E [ y [y T, y T 2 ] ] = [ E[y y T ], E[y y T 2 ] ] = [R,R 2 ] E[yy T ] = E [[ ] y [y T y, y T 2 ] ] [ ] R R 2 = 2 R 2 R 22 But noting that we obtain [R,R 2 ]= [I, 0] H = E[y y T ]E[yy T ] = [R,R 2 ] Thus, ŷ = Hy = [I, 0] [ y ] = y. [ R R 2 R 2 R 22 ] [ R R 2 R 2 R 22 ] = [I, 0] 2 22

12 9. Using part (a) of the previous problem, we have ŷ = y. Therefore, ˆx = ĉt y = c T ŷ = c T y and e = x ˆx = c T y + v c T y = v. If the estimation is based only on the subvector y, then we have ŷ = y, and therefore, and for the error output ˆx = c T ŷ + c T 2 ŷ2 = c T y + c T 2 ŷ2/ e = x ˆx = c T y + c T 2 + v c T y c T 2 ŷ2/ = Setting e 2 = ŷ 2, we have e = v + c T 2 e 2. And, E = E[e 2 ]= σ 2 v + ct 2 E[e 2 e T 2 ]c 2 But, E[e 2 e T 2 ]= R 22 R 2 R R 2, which also shows the non-negativity property. 0. Going through the Gram-Schmidt orthogonalization procedure, we find the matrices B and D = R ɛɛ : B = , D = We also need the inverses 0 0 B = 2 0, R = B T D B = 2 2 Thus, the innovations basis is ɛ ɛ 2 ɛ 2 = ɛ = B y = and conversely, y = y = Bɛ = ɛ ɛ 2 ɛ 3 y y 3 = = y 2y y y ɛ ɛ 2 + 2ɛ ɛ 3 + 2ɛ 2 + 2ɛ For the estimation part, we calculate the h and g weights using the formulas g = D B r = 2 4, h = B T g = R r = 2 6. The three g weights are the optimal weights for the lower order estimation problems, that is, ˆx = g ɛ = 2ɛ ˆx 2 = g ɛ + g 2 ɛ 2 = 2ɛ 4ɛ 2 ˆx 3 = g ɛ + g 2 ɛ 2 + g 3 ɛ 3 = 2ɛ 4ɛ 2 + ɛ 3 Replacing the ɛ i in terms of the y i,weget ˆx = 2y ˆx 2 = 2y 4( 2y )= 0y 4 ˆx 3 = 0y 4 + (y y )= 2y 6 + y 3 For the mean square errors, using the variances of the ɛ i, {E,E 2,R 3 }= {2,, 2}, and starting with E 0 = σ 2 x = 30, we get E = E 0 g 2 E = = 22 E 2 = E g 2 2E 2 = 22 ( 4) 2 = 6 E 3 = E 2 g 2 3E 3 = = 4 For the prediction, we want to show that the a ij coefficients are the matrix elements of B. This can be seen in general by writing the expression of the ɛ i in terms of the y i, as follows: ɛ = y ɛ 2 = ŷ 2/ = + a 2 y ɛ 3 = y 3 ŷ 3/2 = y 3 + a 32 + a 3 y which is equivalent to ɛ = B y. ɛ ɛ 2 ɛ 3 = 0 0 a 2 0 a 3 a 32 y y

13 2. The difference equation is y n 2 cos ω y n + y n 2 = 0 Indeed, using y n = A cos(ω n + φ), we have 2 cos ω y n = 2A cos ω cos ( ω (n )+φ ) = A cos(ω n + φ)+a cos(ω n 2ω + φ)= y n + y n 2 where we used the trig identity 2 cos a cos b = cos(a + b)+ cos(a b) Using this trig identity again, we obtain for the autocorrelation function: R(k) = E[y n+k y n ]= A 2 E[cos(ω n + ω k + φ)cos(ω n + φ)] = 2 A2 E[cos(2ω n + ω k + φ)+ cos(ω k)] = 2 A2 cos(ω k) can be expressed as a linear combination of the other two, for example, the first column is expressible as cos ω cos 2ω = 2 cos ω cos ω cos ω The Gram-Schmidt construction proceeds as follows: where ɛ 0 = y 0 ɛ = y b 0 ɛ 0 ɛ 2 = b 20 ɛ 0 b 2 ɛ b 0 = E[y ɛ 0 ] E[y 0 y 0 ] = R() R(0) = cos ω cos 2ω cos ω The quantity E = E[ɛ 2 ] is calculated by squaring the expression y = ɛ +b 0 ɛ 0, taking expectations of both sides, and using E 0 = E[ɛ 2 0]= R(0): where the first expectation value is zero, as follows from the property 2π E[cos(φ + θ)]= cos(φ + θ) dφ 0 2π = 0 for φ uniform over [0, 2π) and θ deterministic. The 3 3 autocorrelation matrix will be R ij = E[y i y j ]= R(i j). Noting that R(i j)= R(j i), we find R = R(0) R() R(2) R() R(0) R() R(2) R() R(0) Its determinant is = 2 A2 cos ω cos 2ω cos ω cos ω cos 2ω cos ω det R = 8 A6[ + 2 cos 2 ω cos 2ω cos 2 2ω 2 cos 2 ω ] Using the trig identity cos 2ω = 2 cos 2 ω, we can verify that the expression in the brackets vanishes. The same result also follows from the observation that the rank of R is two not three because each column R(0)= E[ ]= E + b 2 0E 0 E = R 0 b 2 0E 0 = R(0)( b 2 0) or, Similarly, we find b 20 = E[ɛ 0 ] E 0 E = R(0)( cos 2 ω )= R(0)sin 2 ω = R(2) R(0) = cos 2ω b 2 = E[ɛ ] = E[(y b 0 y 0 )] = R() b 0R(2) E E E = cos ω cos ω cos 2ω sin 2 ω = cos ω ( cos 2ω ) sin 2 ω = cos ω (2 sin 2 ω ) sin 2 ω = 2 cos ω Thus, the B matrix will be 25 26

14 B = 0 0 b 0 0 b 20 b 2 = 0 0 cos ω 0 cos 2ω 2 cos ω The prediction error E 2 is expected to be zero because the can be predicted exactly from {y 0,y }, as follows from the difference equation applied with n = 2: 2 cos ω y + y 0 = 0 Indeed, squaring the equation = ɛ 2 + b 20 ɛ 0 + b 2 ɛ and taking expectations of both sides, we get 3. Part (a) follows from part (b) and stationarity. Indeed, E[yn+k y n ] 2 E[ n+k ]E[y2 n ] R(k) 2 R(0)R(0) or, R(k) R(0). [ ] Part (b) can be derived as follows: The autocorrelation u matrix of y = is v R = E[yy T ]= E [[ u v ] [u, v] ] = [ E[u 2 ] E[uv] E[vu] E[v 2 ] Because this matrix is positive semi-definite, its determinant will be nonnegative, that is, ] R(0)= E[ 2]= E 2 + b 2 20E 0 + b 2 2E det R = E[u 2 ]E[v 2 ] E[uv] 2 0 and solving for E 2 E 2 = R(0) b 2 20E 0 b 2 2E = R(0) cos 2 2ω E 0 4 cos 2 ω E = R(0) cos 2 2ω R(0) 4 cos 2 ω sin 2 ω R(0) = ( cos 2 2ω )R(0) 4 cos 2 ω sin 2 ω R(0) = sin 2 2ω R(0) sin 2 2ω R(0)= 0 Thus, the D matrix will be E D = 0 E 0 = R(0) 0 0 E sin 2 ω Finally, one should be able to verify the Cholesky factorization R = BDB T, which in this case reads as follows (we removed an overall factor of R(0)): cos ω cos 2ω cos ω cos ω = cos 2ω cos ω = cos ω 0 0 sin 2 ω 0 cos 2ω 2 cos ω cos ω cos 2ω 0 2cosω

ESTIMATION THEORY. Chapter Estimation of Random Variables

ESTIMATION THEORY. Chapter Estimation of Random Variables Chapter ESTIMATION THEORY. Estimation of Random Variables Suppose X,Y,Y 2,...,Y n are random variables defined on the same probability space (Ω, S,P). We consider Y,...,Y n to be the observed random variables